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^ME THOD FOR CONTROLLING DURATION IN SPEECH SYNTHESIS 

Preset invention relates to the field of speech processing, and more 
particularly without limitation, to the field of text-to-speech synthesis. 

The fimction of a text-to-speech (TTS) synthesis system is to synthesize 
speech &om a generic text in a given language. Nowadays, TTS systems have been put into 
5 practical operation for many applications, such as access to databases through the telephone 
network or aid to handicapped people. One method to synthesize speech is by concatenating 
elements of a recorded set of subunits of speech such as demi-syllables or polyphones. The 
majority of successful commercial systems employ the concatenation of polyphones. The 
polyphones comprise groups of two (diphones), three (triphones) or more phones and may be 
10 determined from nonsense words, by segmenting the desired grouping of phones at stable 
spectral regions. In a concatenation based synthesis, ttie conversation of the transition 
between two adjacent phones is cmcial to assure flie quality of the synthesized speech. With 
the choice of polyphones as the basic subunits, the transition between two adjacent phones is 
preserved in the recorded subxmits, and the concatenation is carried out between similar 
1 5 phones. Before the synthesis, however, the phones must have their duration and pitch 

modified in order to fulfil the prosodic constraints of the new words containing those phones. 
This processing is necessary to avoid tihie production of a monotonous sounding synthesized 
speech. In a TTS system, this fimction is performed by a prosodic module. To allow the 
duration and pitch modifications in the recorded subunits, many concatenation based TTS 
20 systems employ the time-domain pitch-synchronous overlap-add (TD-PSOLA) (E. Moulines 
and F. Chaipentier, "Pitch synchronous waveform processing techniques for text-to-speech 
synthesis using diphones," Speech Commun., vol. 9, pp. 453-467, 1990) model of synthesis. 
In the TD-PSOLA model, the speech signal is first submitted to a pitch marking algorithm. 
This algoriflmi assigns marks at the peaks of the signal in the voiced segments and assigns 
25 marks 10 ms apart in the imvoiced segments. The synthesis is made by a superposition of 
Hanning windowed segments centered at the pitch marks and extending firom the previous 
pitch mark to the next one. The duration modification is provided by deleting or replicating 
some of the windowed segments. The pitch period modification, on the other hand, is 
provided by increasing or decreasing the superposition between windowed segments. 
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Despite the success achieved in many commercial TTS systems, the synthetic 
speech produced by using the TD-PSOLA model of synthesis can present some drawbacks, 
mainly xmder large prosodic variations, outlined as follows. 

Examples of such PSOLA methods are those defined in docxunents EP- 
0363233, U.S. Pat No. 5,479,564, EP-0706170. A specific example is also the MBR-PSOLA 
method as published by T. Dutoit and H. Leich, m Speech Communications, Elsevier 
PubUsher, November 1993. U.S. Pat. No. 5,479,564 suggests a means of modifying the 
firequency of an audio signal with constant fundamental firequency by overlap-adding short- 
term signals extracted firom this signal. The length of the weighting windows used to obtain 
the short-term signals is approximately equal to two times the period of the audio signal and 
their position within the period can be set to any value (provided the time shift between 
successive windows is equal to the period of the audio signal). Docxmient U.S. Pat No. 
5,479,564 also describes a means of interpolating waveforms between segments to 
concatenate so as to smooth out discontinuities. Such PSOLA methods enable to modify the 
duration of a given speech signal. This is done by repeating or deleting pitch bells before an 
overlap and add operation is performed for the speech synthesis. The information in a pitch 
bell is not always suitable for repetition like in a plosive soimd. It is a common disadvantage 
of prior art PSOLA methods that artefacts are introduced this way. These artefacts can lead to 
a metallic sound of the synthesized speech signal and can even seriously affect or destroy the 
intelligibility of the synthesized signal. 

The present invention therefore aims to provide an improved method for 

processing of a speech signal. 

The present invention provides a method, a computer program product and a 
computer system for processing of a speech signal. In essence, the present uivention enables 
to synthesize a natural sounding synthesized speech signal with improved intelligibility. 

This is accomplished by classifying certain intervals contained in the original 
speech signal. In accordance with a preferred embodiment of the invention 'steady' and 
'dynamic' intervals are identified within the original speech signal. This classification needs 
to be performed only once. It is utilized for synthesizing a speech signal based on the original 
speech signal with a modified duration. 



wo 2004/027758 PCT/IB2003/003360 

3 

The present invention is based on the observation that the r^etition of pitch 
bells form dynamic intervals, as it is done in prior art PSOLA methods, introduces an 
unintentional periodicity which leads to artefacts, such as a metallic sounding synthesized 
signal, and to reduced or destroyed intelligibility. 

In accordance with the present invention this problem is solved by restricting 
the processing of pitch bells for the purpose of duration modification to pitch bells of steady 
intervals of the original speech signal. In other words duration modifications are only 
performed on those speech intervals which can have different durations. This is true for the 
middle of a vowel or a consonant like tiie /s/ sound. But there are cases where local events 
occur that last less than a single period. These are sudden changes like the start of an 
unvoiced plosive (/p/, /t/, Ikl) or the ticks and clicks produced by the tongues and the mouth 
(/b/, /d/, /g/, /!/, /m/, /n/, etc.). Periods containing these events are important for intelligibility 
and should not be omitted by manipulation.. Repeating them is also a problem since this 
introduces artefacts that sound unnatural. Also the periods at the start of a transition from an 
unvoiced soimd to a vowel have local features that should not be made longer or shorter. To 
avoid artefacts, aU periods are marked with a special period class-type information. This 
information is used to determine whether a period can be repeated or omitted. Hence, pitch 
bells which are obtained by windowiag of dynamic intervals of the original speech signal are 
not repeated for duration modification. Pitch bells which are obtained from intervals which 
are classified as dynamic and of being essential for the intelligibiUty are kept in the 
synthesized signal in order to maintain intelligibiHty. Pitch bells which are obtained by 
windowiag of intervals of the original speech signal which are classified as dynamic but as 
not being essential for intelligibility may or may not be deleted before performing the overlap 
and add operation without seriously affecting the quality of the resulting synthesized speech 
signal. 

A preferred application of the present invention is for text-to-speech systems 
which store a large number of natural speech recordings which are modified in the process of 

text-to-speech synthesis. 

la accordance with a preferred embodiment of the invention a raised cosine 
window is used for the windowing of the speech signal. Preferably a sine window is used for 
steady intervals containing imvoiced speech. The pitch bells obtained for such steady 
intervals containing unvoiced speech are randomized in order to remove any unintended 
periodicity which can be introduced in the process of duration modification. 
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In the following preferred embodiments of the invention will be described in 
greater detail by making reference to tiie drawings in which: 

Fig. 1 is illustrative of a flow chart of a preferred embodiment of the present 

5 invention. 

Fig. 2 is illustrative of the synthesis of a speech signal based on an original 
speech signal in accordance with an embodiment of the present invention. 

Fig. 3 is a block diagram of an onbodiment of a computer system of the 

invention. 

10 



Fig. 1 shows a flow diagram to illustrate a preferred embodiment of a method 
of the invention. In step 100 a recording of natural speech is provided. In step 102 intervals in 
the natural speech recording are identified and classified. For the classification of the speech 
15 intervals the following classiflcation system is used in the example considered here: 



— — silence 

— unvoiced period 

V — voiced period 

p — crucial dynamic unvoiced period (should only be used once) 

20 b — cmcial dynamic voiced period (should onlv be used once) 

q — dynamic unvoiced period (may only be used once) 

c — dynamic voiced period (may only be used once) 



The two basic categories of speech intervals are ^steady^ and ^dynamic' speech 
intervals. A speech interval is classified as 'steady' when it has an essentially constant signal 
25 characteristic for a consecutive number of at least two periods of the ftmdamental fi-equency 
of the natural speech signal. In contrast the speech interval of the original speech recording is 
classified as 'dynamic' when it's signal characteristic only occurs within one period of the 
fimdamental frequency. 

In the classification system considered here the '.^ and 'v' periods are steady 
30 periods. The 'p', 'b\ 'q' and 'c' periods are dynamic periods which are treated differently in 
the subsequent processing. 

In step 104 the natural speech signal is windowed to obtain pitch bells. 
Preferably the windowing is performed by means of a raised cosine window or with a sine 
window for the '.' periods. 
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In step 106 the pitch bells which are obtained for periods which are classified 
as 'steady' are processed in order to modify the duration of the speech signal. This can be 
done by repeating or deleting of pitch bells to increase or decrease the origmal duration, 
respectively. Pitch bells which are obtained from periods which are classified as 'dynamic'' 

5 are not repeated in order to avoid the introduction of artifacts. Pitch bells which have been 
obtained from periods which are classified as 'p' or 'b' can not be deleted in order to 
maintain the intelligibility of the original signal. Pitch bells which are obtained for periods 
which are classified as 'q' or "c are also not repeated, but can be deleted without seriously 
effecting the intelligibility of the resulting synthesized signal. 

10 Preferably pitch bells for periods which are classified as \' are obtained in a 

randomized way in order to avoid the introduction of periodicity. This is fiuiher helped by 
the usage of a sine window for the windowing of those periods. 

In step 108 the processed pitch bells are overlapped and added in order to 
obtain the synthesized signal. 

1 S Fig. 2 is illustrative of an example for the processing of a natural speech signal 

200. The natural speech signal 200 has dynamic intervals 202, 204, 206, 208, 210 and 212. 
The dynamic interval 202 contains periods which are classified as 'b\ 'c\ The dynamic 
interval 204 contains periods which are classified as 'c\ 'q\ The dynamic interval 206 
contains periods which are classified as 'q\ The dynamic interval 208 contains periods which 

20 are classified as 'q\ 'c' and 'b\ The dynamic interval 210 contains periods which are 

classified as 'c\ 'b\ Finally the dynamic interval 212 contains periods which are classified as 
^c^ and 'b\ Further the natural speech signal 200 contains steady intervals 214, 216, 218, 
220, 222 and 224. The steady interval 214 contains periods which are classified as 'v'; the 
steady interval 216 contains periods which are classified as the steady interval 218 

25 contains periods which are classified as the steady interval 220 contains periods which are 
classified as the steady interval 222 contains periods which are classified as V and the 
steady interval 224 contains periods which are classified as 'v\ This classification can be 
performed either manually or automatically by means of an appropriate signal analysis 
program. Preferably an automatic analysis is performed by means of such a program which is 

30 then controlled by a human expert and manually corrected, if necessary. It is to be noted that 
this classification needs to be performed only once in order to enable an unlimited nimiber of 
signal syntheses. 

In the example considered here a signal is to be synthesized based on the 
natural speech signal 200 which has an extended duration as compared to the original speech 
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signal 200. For this purpose the natural speech signal 200 is windowed by means of a 
window positioned synchronously with the fundamental frequency of the natural speech 
signal 200 as it as such known from the prior art and used in PSOIA type methods. 

Preferably a raised cosine is used as window. For periods which are classified 
as a sine window is used in order to reduce unintended periodicity which may be 
introduced when pitch bells of the noisy signal portion are repeated. As a ftirther measure 
against unintended periodicity the pitch bells for the \' classified periods are acquired in a 
randomized way. In the example considered here the signal to be synthesized is composed as 
follows in the domain of the time axis 226: 

The first interval 228 of the speech signal to be synthesized contains the pitch 
bells from the dynamic interval 202. These pitch bells are used for the interval 228 without 
modification which implies that the duration of the interval 228 is xmchanged with respect to 
the dynamic interval 202. The duration of the interval 230 is about twice the duration of the 
corresponding steady mterval 214. This is accompUshed by repeating each of the pitch bells 
acquired for the steady interval 214. Interval 232 contains the pitch bells from the dynamic 
interval 204. The duration of 232 is unchanged as compared to ttie dynanric interval 204. 
Interval 234 is constituted by pitch bells acquired from steady interval 216. Again each of the 
pitch bells contained in the steady interval 216 is repeated in order to double the duration of 
this interval. Likewise the following intervals 236, 238, 240, 242, ...are obtained from the 
intervals 206, 218, 208, 220, 210, 222, 212, 242. Next the pitch bells are overlapped in the 
domain of the time axis 226 in order to obtain the resulting synthesized signal. Alternatively 
the pitch bells obtained from the periods of the natural speech signal 200 which are classified 
as 'q' or 'c' can be deleted. In any case none of the pitch bells which are obtained fix)m 
periods of the natural speech signal 200 which are classified as 'dynamic' are repeated. This 
way a duration modification can be performed without introducing artifacts which would 
otherwise seriously impact the quality and intelligibility of the synthesized signal. 

In the example considered here *p* is used to mark local (unvoiced) events that 
are crucial for the intelligibility of the spoken utterance. Usually, the noise burst after the 
release of air by the mouth or the tongue is of this type. The phonemes /p/, IXI and Ikl have at 
least one such period. Periods marked with *p' should appear only once at the synthesized 
speech, regardless of the final duration of the phoneme. Some local (unvoiced) events are not 
crucial for intelligibility but are so dynamic that repeating them would introduce a series of 
mmatural sounding periods. These periods are marked with the letter *q'. They may only be 
used once, but they can also be omitted without a major degradation in quaUty or 
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intelligibility. The voiced counterparts for *p' and *q* are the types denoted by 'b' and 'c*. 
The voiced plosives /b/, /d/ and /g/ usually have at least one period marked with *b\ Also the 
tongue can produce tick and click sounds when it hits or leaves other parts of the mouth. The 
phoneme /I/ is an example where this can happen. The transition fiom silence to vowels or 
5 from unvoiced consonants to vowels also have periods with local events. Although the 
periods in the middle of a vowel can be repeated many times without affecting the 
naturabiessy the periods that fall right in the middle of the transition are too dynamic for 
repetition. 

Fig. 3 shows a block diagram of an embodiment of a computer system of the 
10 invention. Preferably the computer system is a text-to-speech system which embodies the 

principles of the present invention. The computer system 300 has a module 302 which serves 
to store natural speech signals. Module 304 serves to automatically, manually or interactively 
classify periods of the natural speech signals stored in the module 302. Module 306 serves to 
perform the windowing of a natural speech signal stored in the module 302. This way a 
1 S number of pitch bells are obtained. Module 308 serves for pitch bell processing. The pitch 
bell processing for duration modification is only performed on pitch bells which are obtained 
from intervals which are classified as steady. In addition pitch bells from dynamic intervals 
which are classified as not being essential for the intelligibility can be deleted by module 308, 
such that they do not occur in the synthesized signal. Module 310 serves to perform an 
20 overly and add operation of the resulting pitch bells in order to obtain the synthesized signal. 
The desired modification of the duration of the original natural speech signal stored in 
module 302 is inputted into the computer system 300. The resulting synthesized signal is 
outputted from the computer system 300 on a carrier wave or as a data file. 
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LIST OF REFERENCE NUMERALS: 
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natural speech signal 
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dynamic interval 
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dynamic interval 
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dynamic interval 


5 


208 
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steady interval 
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